System of finite state machines

ABSTRACT

A system of finite state machines built with asynchronous or synchronous logic for controlling the flow of data through computational logic circuits programmed to accomplish a task specified by a user, having one finite state machine associated with each computational logic circuit, having each finite state machine accept data from either one or more predecessor finite state machines or from one or more sources outside the system and furnish data to one or more successor finite state machines or a recipient outside the system, excluding from consideration in determining a clock period for the system logic paths performing the task specified by the user, and providing a means for ensuring that each finite state machine allows sufficient time to elapse for the computational logic circuit associated with that finite state to perform its task.

TECHNICAL FIELD

This paper reports on the design of the Trebuchet, a pseudo-asynchronousmicropipeline. It grows out of an earlier effort we called SMAL [4].SMAL was a system, which, like the Trebuchet, was compiled from Java tohardware. The execution engine was basically a massive synchronouslyclocked pipeline network that became unwieldy with any but the smallestpieces of software.

The Trebuchet, in contrast, targets a linear pipeline composed of manyinteracting state machines. (See FIG. 1.) Each state machine, whileentirely synchronous, behaves asynchronously in the sense that stagelatency depends on the complexity of the computation performed ratherthan on the clock period. The clock period, rather than being dependenton the longest logic chain, is set by the time required to exchangehandshaking signals with immediate neighbors.

On the global scale, the Trebuchet resembles Sutherland's asynchronousmicropipelines [27], inheriting many of its behavioral characteristics,despite being completely synchronous in the details. Hence we describethe Trebuchet as Pseudo-Asynchronous. We claim that pseudo-asynchronismallows the flexibility required to implement software as a cohesivehardware machine, and that the benefits normally ascribed toasynchronous machinery may be achievable [6]. Specifically, likeconventional asynchronous circuitry, throughput should depend on theaverage stage latency rather than the longest logic chain. Likewise,current draw should be smoothed with attendant reductions in radiatedEMI.

The thrust of our research has been to develop a methodology forseamless implementation of hardware and software functionality. It hasapplications in both the fields of embedded systems and reconfigurablecomputing engines. The first is concerned with dividing functionalitybetween software and hardware (generally application specific integratedcircuits (ASICs)) whereas the latter seeks to off-load time-criticalfunctions to temporary circuits configured into field programmable gatearrays (FPGAs).

In contrast to conventional projects where software is prepared in anenvironment of existing and stable hardware, embedded computer systemstypically require parallel development of hardware and softwarecomponents. Because the respective disciplines of hardware and softwaredevelopment are commonly conceived of as quite different, earlydecisions about task allocation are made which have profound andirreversible consequences on the ultimate cost and performance of thesystem. Consequently, much of the research in embedded system technologyis devoted to blending the development methodologies by deriving bothhardware and software from high-level descriptions so that decisions canbe delayed as long as possible and are demonstrably correct when made[1].

Because conventional hardware design is tedious and foreign to mostsoftware practitioners, generating hardware from software has been animportant research goal [22]. Generally speaking, these approachestarget reconfigurable machinery hosted on FPGAs. However, reconfigurableco-processors tend to execute their functions rapidly and then remainidle for long periods of time. Consequently, the question of how todynamically reconfigure the co-processor has also become important[8][16][28].

Our overall goal has been to develop a methodology in which the use ofCPUs, reconfigurable logic implemented in FPGAs, and permanentlyconfigured logic in an ASIC (application specific integrated circuit)are all parameters in a scheduling problem [4]. But because availablechip real estate allows implementation of only small portions of aprogram in hardware, it seems desirable to develop an approach thatdeals with multiple chips and the consequences of chip and boardboundaries.

We seek a methodology that automatically allocates portions of a programto a network of execution resources based on single-threaded softwareexecution profiles. A methodology that constructs systems using bothhardware and software can take best advantage of the available executionresources. Code written in a high level language (e.g., Java) mayexecute as machine instructions or directly as hardware. The decisionshould depend on the needs of the work to be done and the resourcesavailable to do it.

We chose Java as the application language because the Java VirtualMachine (JVM) has a simple and regular addressing scheme withoutregisters, and because the interpreter makes it easy to gather executionstatistics that may be used in mapping experiments. Ours and otherresearch at our institution indicated that Java execution is predictableenough that transformation of portions of an application to hardware ispossible [24]. Fleischmann and Buchenrieder are also using Java to studyreconfigurable hardware systems, but do not generate hardwareautomatically, as our system does [8]. Hutchings et al. are doing lowlevel hardware design with Java-based JHDL [15], but their tool is notaimed at high performance pipeline systems.

This invention relates to organization and structure of computingmachinery residing in a FPGA or ASIC device. Application programs aretypically compiled from source code into a stream of instructions thatcontrol successive actions by a CPU (Central Processing Unit). Howeverhigh performance machinery can be directly constructed to perform theintended computation without recourse to a CPU. This is typically donein signal processing and other applications with large throughputrequirements and significant low-level parallelism.

BACKGROUND ART

There are three main bodies of work that bear on this problem. Taskpartitioning has been traditionally applied to large scalemultiprocessing, but the issues are the same: how to parcel out work tocomputational resources. Systolic processing is a technique foroverlapping computations at a fine granularity. Reconfigurable computingaims to off-load processing to temporarily configured hardware.

Systolic Processing

Systolic processing arrays are characterized by regular arrays ofprocessors fixed in place with data streaming through them. Considerablespeed can be achieved due to the high degree of pipeline-ability. Most,though not all, systolic processing is performed on digital signalprocessing (DSP) applications.

Kung proposed this mode of computing in [19] as a straight-forwardmapping of signal flow graphs onto hardware. By performing severaloperations on a data item before returning it to memory, throughput ofcompute bound programs could be greatly increased. Kung provided asemi-automatic method of transforming a data flow graph into a systolicarray configuration. He noted that memory bandwidth is likely to remaina bottleneck even after systolization. Systolic arrays as conceived byKung were dedicated hardware devices.

Systolic processing has grown with the work of many researchers.Johnson, et al. [18] surveyed the state of the art in 1993. They foundthat most of the work had shifted away from dedicated processors towardreconfigurable hardware. Programming was typically done by schematicentry or in hardware oriented languages such as VHDL, and mostimplementations relied on Field Programmable Gate Arrays (FPGAs). Theyidentified the low pin-out of FPGAs as a major limiting factor: thebottleneck in processing rate was communication with the FPGA. Theynoted that technology limited designs to static configurations andidentified automatic array synthesis as an important area to pursue.Since publication of Johnson's survey, both of these have bean activelyresearched (see [1] and [18])

Reconfigurable Computing

Research on configurable computing engines is hampered by the inabilityto compare results. In a recent article discussing the needs of thecommunity, a committee stated that it is difficult to decide whetherdifferences in performance reported by investigators are due toarchitectural consequences or individual skill at circuit design [22].They felt that a methodology for describing reconfigurable architectureand assessing performance would be of great value, especially if itsubsumes the differences of fine-grained commercial devices and ‘chunky’approaches. Unfortunately the latter was assessed as unlikely until moreexperience with reconfigurable machines is acquired.

Athanas and Silverman developed the PRISM (Processor Reconfigurationthrough Instruction Set Metamorphosis) as a more flexible alternative tospecial purpose machines [1]; this work was done as Athanas' Ph.D. workunder Silverman. They noted that dramatic speedups can be had byimplementing the most compute-bound portions of a program in hardware.They sought to replace dedicated hardware co-processor units with FPGAsand a high-level language interface.

They point out that communication bandwidth between the CPU and theco-processor is critical to the success of the technique. Toward thisend they sought to improve bus access of the co-processor so thattransfers would be less expensive. They do not, however, systematize theallocation of program parts to execution domains. In their model,execution occurs in two distinct modes—conventional style with opcodesin the CPU, and hardware implemented in the co-processor. Then theyattempt to manage the bottlenecks between the two. They also point outthat certain portions of programs give greater benefit when assigned tothe co-processor. There is no attempt to select the portions of codethat yield the most benefit by assignment; they depend instead on theprogrammer's knowledge of where the code spends the most time.

Athanas and Silverman felt that important directions for researchincluded development of special purpose FPGAs that provide bettersupport for architectural features. These include shadow configurationsto support rapid switching between configurations, faster configurationdown loads, and support for context-switching and resource sharingbetween time-shared tasks. This has merit, but certain criticaldeficiencies hamper the success of the work. They are:

-   -   Applications that need special purpose hardware are unlikely        prospects for timesharing. If throughput cannot be adequately        provided by general-purpose platforms, timesharing would be the        first convenience to give up. Hence techniques to share        execution assist hardware (co-processors) are not needed, at        least within executing programs. (Sharing resources between        runs, on the other hand, is a different story, and the whole        reason for configurability.)    -   Applications that process steady streams of data will be slowed        by switching the critical execution resource between competing        portions of the code. As with competition between time-shared        tasks, if the need for speed warrants special purpose hardware        (even if reconfigurable), then that hardware should spend its        entire time doing one particular thing. If the need permits        switching the resource out, then fast hardware that has been        optimized for sharing (i.e. a CPU) should be used.    -   There is no governing theory that allows coherent decisions to        be made about the relative merit of configuring one portion of        code versus a different portion in the high-speed assist unit.

Wirthlin and Hutchings are attempting to increase what they call thefunctional density of circuits implementing software [28]. They haveexamined the possible gains that would accrue to systems with improvedreconfiguration times and characterized it against the length of thecalculation. Not surprisingly, they conclude that the smaller theduration of the calculation, the more sensitive it is to configurationlatency. They have also attempted to improve functional density bypreparing specialized operators; e.g., multiplication by a constantwould employ circuitry that does the specific multiplication rather thanusing a general multiplication circuit and a constant. This results inboth smaller and faster circuits.

Maya Gokhale built the Splash, a reconfigurable processor used by manyresearchers; in [9] she details the architecture. It consists of alinear sequence of thirty-two FPGAs which function as configurablepipeline stages. The FPGAs, Xilinx 3090 chips, are programmed in VHDL.The whole assembly communicates with a host processor across a VME bus.

Gokhale reported a speedup of 330 over Cray-2 performance in spite ofthe fact that the Splash is severely I/O limited. She speculates thatmany applications could achieve an additional ten-fold speed-up if theI/O bottleneck were removed. There is no concept of hierarchy associatedwith Splash, and the only accommodation for the disparity in processorbus speed and Splash's processing rate is an eight megabyte stagingmemory.

While Splash is important as a research tool for configurable computing,the following criticisms are apply:

-   -   The lack of hierarchy means that there can be no accommodation        of lower bandwidth input and output streams.    -   Routine programming in VHDL is tedious.    -   The logic packages are small (Xilinx 3090s) and the pin        connectivity is still more limiting.

C. A. R. Hoare and his laboratory are working on configurable computing.In [13] he details his process for compiling high level code (in thiscase occam) to hardware. His emphasis is on generating correcttranslations by the use of ‘correctness-preserving’ transformations. Thetranslation is based on a state machine which activates each operationsequentially. His example showed a small program that was efficientlytranslated to hardware. In practice, only small chunks of code could beconfigured because the HARP board he used accommodated only one FPGA.

Hwang in [17] explores a concept that he calls pipenets. This is ageneralization on vector processing where arrays are streamed through asequence of cascaded operations. The implementation that he proposed wasa sequence of operators connected by cross-bar switches. Pipenets werelimited to processing of arrays and had no hierarchy. No actualimplementation was reported.

Yen and Wolf explored the problem of dividing an acyclic task graphbetween available processors which may be either one of several types ofCPU or an ASIC. They iteratively explored the alternative configurationsaccounting for communication and processing time. They accounted for thecost of sharing communication and CPU resources, but did not allow forshared hardware of ASIC resources. Tasks were required to be completelyresident on either a particular CPU or ASIC, as there was no treatmentof hierarchical networks. There was no treatment of reconfigurationdelays because ASIC resources were not shared.

Chiodo et al. proposed a uniform execution model for hardware andsoftware hosted execution called Co-Design Finite State Machines (CFSM).Execution is carried out by communicating finite state machines whichmay reside in either hardware or software. C code could be used togenerate either hardware or software, but there was no automaticpartitioning [5].

Peng and Shin [25] use a least common multiple (LCM) approach topartitioning a task load among a set of processors. The idea is thatscheduling is easier if the total load can be treated as a singlenon-repeating task. To that end, a super-task is created by replicatingthe task executions until they all end together. The length of thesuper-task is then the LCM of all of the task periods. For schedulingpurposes, the super-task can be treated as if it were non-periodicbecause there are no side effects that propagate into the nextsuper-cycle.

The LCM approach is hard to apply in practice for the following reasons:

-   -   The periods of most tasks are many hundreds or thousands of CPU        cycles. If the task periods are relatively prime, the length of        the planning cycle becomes prohibitive [29].    -   If the period is bounded but not constant (e.g. engine speed),        the priority scheme must be validated for each possible period        [29].    -   The events triggering separate tasks must be synchronized in        order to retain validity of schedules derived.

Peng and Shin explored an interesting branch and bound algorithm tospeed discovery of the best task partition and schedule. They allocatetasks to processors and note the system hazard (the task latency dividedby the available latency). They get a lower bound on the ultimate systemhazard by using the load imposed by allocated tasks and an approximationof the load that unallocated tasks will eventually impose. Theapproximation is not exact because it neglects the contention thatunallocated tasks will cause each other, but it is a valid lower boundon that load. The lowest cost alternative is chosen for expansion untilcompletely expanded configurations with all tasks allocated are reached.Because completely allocated configurations have exact costs, theyenable pruning of unexpanded alternatives that have inferior costbounds. When such a configuration has a cost lower than all otheralternatives, it is optimal.

Peng and Shin correctly claim polynomial time complexity for thebounding and pruning operation, but this does not imply that the wholealgorithm has polynomial time complexity. There is no argument thatsufficient branches are pruned to guarantee that a polynomial boundednumber of nodes will be investigated. Their experimental data indicate,however, that average performance is quite good.

Ptolemy [2] is a C++ system that relies on object oriented programmingand class inheritance to provide a uniform programming interface forsynthesizing hardware or conventional software on networked CPUs. Whilea powerful programming tool, the programmer decides what code shouldbecome hardware or software and on what machine it should run.

COSYMA was developed at the University of Baunschweig as a vehicle forexperimenting with hardware and software partitioning algorithms [7].Code is compiled from a C-related language called C^(α). The output ofthe compiler is an acyclic graph of basic blocks which are allocated toa single CPU or ASIC. Communication is via memory, the processor haltingwhen control transfers to the ASIC. Allocation to hardware or softwareis determined by simulated annealing with a cost function based oninstruction timing, communication overhead, and hardware performance.COSYMA has the following limitations [29]:

-   -   The architecture accommodates only one CPU and one ASIC.    -   Hardware and software components may not run concurrently.    -   The performance assessment algorithm used cannot handle periodic        and concurrent tasks.    -   Simulation-based timing may not be accurate enough for hard        timing constraints.    -   The simulated annealing cost function did not account for        hardware costs.

Lehoczky and Sha researched the application of real-time schedulingtechniques to bus communication between processors in a distributedsystem [20]. They did not extend their results to other resources suchas distributed array access or contention for FPGAs by alternativeconfigurations sharing the same hardware.

The embedded systems community is primarily concerned with partitioningsoftware functionality between one or more CPUs and non-reconfigurablecircuits fabricated as ASICs. Yen and Wolf typify this group, whichincludes Buck, Gupta, Ernst, and Chiodo. The configurable computingengine community also investigates problems associated with realizingsoftware as circuits. Most of this research involved reconfigurationonly as programs are loaded and thus bears strong similarity to the ASICwork in the embedded system community. Athanas, Gokhale, and Hoaretypify this approach. The next level of reconfigurability is dynamicreconfiguration investigated by Hutchings in which the contents of theFPGA are switched during execution. There is no existing work whichaddresses hierarchical reconfiguration.

Yen and Wolf, Lehoczky et al., and Peng and Shin are concerned withguaranteed latency bounds. The only research that treats the real-timebehavior of non-software objects is that of Lehoczky and Yen. Lehoczkytreated inter-processor busses as a real-time resource that must beshared. Yen and Wolf are included because they treated ASICs asreal-time objects, although these were not shared and consequently hadtrivial real-time behavior.

All of the work that dealt extensively with partitioning, if itmentioned program structure at all, stated that acyclic graphs are theformat of program components that are manipulated. Acyclic graphssimplify the complexity of algorithms that manipulate data structures(this likely accounts for their widespread use: Yen and Wolf, Peng andShin, Gupta et al., Ernst et al., Stone, and Bokhari). But such graphslimit the granularity of the program objects manipulated to high levelmodules.

Henkel and Ernst examined use of multiple heuristics for partitioningsoftware between CPU execution and a co-processor [10]. This work wasmotivated by the observation that particular heuristic rules work wellfor certain granularity, but not others. The recognition of granularityis important because programs behave differently at different scales.There is no effort by Henkel and Ernst optimize placement of piecesdesignated for hardware execution.

Most projects (Athanas, Hutchings, Gokhale, Hoare, Buck, Gupta, Ernst,and Chiodo) did not address automatic partitioning, relying instead onthe programmer to designate assignments of hardware to execution units.Of those who undertook automatic partitioning (Peng and Shin, Stone, andBokhari) divided the work load up between CPUs. Of those who addressedpartitions between hardware and software, Gupta and Ernst were limitedby their approach to systems of one CPU and one ASIC. Only Yen and Wolfdealt with partitions among multiple CPUs and ASICs.

Existing research does not treat computing resources as hierarchicalcollections of reconfigurable objects. This treatment will not onlysystematize the generation of reconfigurable designs, but also unify thedisparate ideas in conventional computing. Most of the existingtechniques for analyzing programs for mapping into networks are onlyvalid for acyclic graphs. A more general technique that deals withlooping behavior is needed. In order to map programs into actualhardware, it will be necessary to account for the effects of competingaccesses to shared objects. No existing work has generalized thereal-time scheduling techniques to shared objects like arrays or commonsubroutines.

Our Prior Research

Our existing work [3] addresses the mapping of systolic software intonetworks of execution resources. The key idea is that both software andhardware can be organized in hierarchic domains based on bandwidth ofcommunication. Hardware tends to be packaged in units that naturallyreflect this. Signals on a chip are nearly always faster than signalsgoing off-chip. Communication between chips on a board is usually fasterthan messages to other boards. But even when the boundaries betweenhigher and lower bandwidth communication domains do not correspond tophysical packaging, they are non-the-less real. Software also exhibitsthis characteristic in that some portions of the code will inherentlycommunicate more frequently. Thus software also can be analyzed andhierarchical domain structure developed based on inherent communicationfrequencies. Good performance depends on mapping high bandwidth softwaredomains into hardware domains that can support it.

This paper reports on the design of the Trebuchet, a pseudo-asynchronousmicropipeline. It grows out of an earlier effort we called SMAL [4].SMAL was a system, which, like the Trebuchet, was compiled from Java tohardware. The execution engine was basically a massive synchronouslyclocked pipeline network that became unwieldy with any but the smallestpieces of software.

A number of relevant articles exist. These are given below preceded by areference number which is utilized to cite to a specific articlethroughout this application:

-   [1] Athanas, P. M., and Silverman, H. F., Processor Reconfiguration    Through Instruction-Set Metamorphosis, Computer, Vol 26, pp 11-18,    March 1993.-   [2] Buck, J., Ha, S., Lee, E. A., and Messerschmitt, D. G. Ptolemy:    A Framework for Simulating and Prototyping Heterogeneous Systems,    International Journal of Computer Simulation, January 1994.-   [3] Campbell, J. D. and Abbott, B. Gear Train Theory: An Approach to    the Assignment Problem Providing Tractable Solutions with Measured    Optimality, International Conference on Parallel and Distributed    Processing Techniques and Applications, Vol II, pp 986-95, Jun.    30-Jul. 3, 1997.-   [4] Campbell, J. D. Experience with a Reconfigurable Java Machine,    International Conference on Parallel and Distributed Processing    Techniques and Applications, pp 2459-66, Jun. 26-29, 2000.-   [5] Chiodo, M., Guisto, P., Jurecska, A, Hsieh, H. C.,    Sangiovanni-Vincentelli, A., and Lavagno, L., Hardware-Software    Codesign of Embedded Systems, IEEE MICRO, 14(4):26-36, August 1994-   [6] Davis A., and Nowick S. M. An Introduction to Asynchronous    Circuit Design. University of Utah Technical Report, UUCS-97-013,    September 1997.-   [7] Ernst, R., Henkel, J, and Benner, T. Hardware-Software    Co-Synthesis for Microcontrollers, IEEE Design & Test of Computers,    10(4), December 1993-   [8] Fleischmann, J. and Buchenrieder, K., Prototyping Networked    Embedded Systems, Computer, Vol 32, No 2, pp 116-19, February 1999-   [9] Gokhale, M., Holmes, W., Kopser, A., Lucas, S., Minnich, R.,    Sweely, D., and Lopresti, D. Building and Using a Highly Parallel    Programmable Logic Array, IEEE Computer, January 1991, pp 81-89-   [10] Hekel, J. and Ernst, R. An Approach to Automated    Hardware/Software Partitioning Using a Flexible Granularity that is    Driven by High-Level Estimation Techniques, IEEE Transactions on    Very Large Scale Integration (VLSI) Systems, Vol. 9, No. 2, April    2001, pp 273, 289-   [12] Hennessy, J. L., and Patterson, D. A., Computer Architecture a    Quantitative Approach, Morgan Kaufmann Publishers, Inc., pp 371-380,    1990.-   [13] Hoare, C. A. R., and Page, I. Hardware and Software: The    Closing Gap, Transputer Communications, Vol 2, June 1994, pp 69-90-   [14] http://oss.software.ibm.com/developerworks/opensource/jikes/-   [15] http://www.jhdl.com/release-latest/docs/overview/intro.html-   [16] Hutchings, B., and Wirthlin, M. J. A Dynamic Instruction Set    Computer, Proceedings of the IEEE Symposium on FPGAs for Custom    Computing Machines, pp 92-103, April 1995-   [17] Hwang, K., and Xu, Z., Multipipeline Networking for Compound    Vector Processing, IEEE Transactions on Computers, Vol 37, No. 1,    January 1988, pp 33-47-   [18] Johnson, K. T., Hurson, A. R., and Shirazi, B. General Purpose    Systolic Arrays, IEEE Computer, November 1993, pp 20-31-   [19] Kung, H. T. Why Systolic Architectures?, IEEE Computer, January    1982, pp 37-46-   [20] Lehoczky and Sha, Performance of Real-Time Bus Scheduling    Algorithms, ACM Performance Review, May 1986.-   [21] Joseph Y.-T. Leung and Whitehead, J., On the complexity of    fixed-Priority Scheduling of Periodic, Real-Time Tasks, Performance    Evaluation, s:237-250, 1982-   [22] Magione-Smith, W. H., Seeking Solutions in Configurable    Computing, Computer, Vol 30, pp 38-43, December 1997.-   [23] Meyer, J., and Downing, T., Java Virtual Machine, O'Reilly,    1997-   [24] Narayanaswamy, P., Dynamic Arithmetic-Logic Unit Cache, Masters    Thesis, Dept of Electrical Eng., Utah State University, 1999-   [25] Peng and Shinn, Optimal scheduling of cooperative tasks in a    distributed system using an enumerative method, IEEE Transactions on    Software Engineering Vol. 19, March 1993, pp 253-67-   [26] Stone, H. S. Multiprocessor Scheduling with the Aid of Network    Flow Algorithms, IEEE Transactions on Software Engineering, Vol    SE-3, No, 1, January 1977-   [27] Sutherland, I. E., Micropipelines, Communications of the ACM,    Vol 32, No 6, pp 720-738, 1995-   [28] Wirthlin, M. J. and Hutchings, B. L., Improving functional    Density Through Run-Time Constant Propagation, Field Programmable    Gate Array Workshop, pp 86-92, 1997-   [29] Yen, T., and Wolf, W., Hardware-Software Co-Synthesis of    Distributed Embedded Systems, Kluwer Acedemic Publishers, 1996-   [30] Constraints Guide, Xilinx, Inc., 2001-   [31] Development System Reference Guide, Xilinx, Inc., 2001

Relevant prior patents include the following U.S. Pat. Nos. 5,834,957;5,841,298; 6,044,457; 6,289,488; and 6,230,303.

DISCLOSURE OF INVENTION

The Trebuchet runs hardware compiled from software source code. In thepresent implementation, Java is the source code language of applicationprograms to be run on the Trebuchet. We modified Jikes [14], an opensource Java compiler originally from IBM, to include extra informationwe needed for the conversion to hardware. The output of Jikes is astandard Java class file. We obtain profile information from a modifiedJVM (Java_g, part of the Sun Java JDK). The profile also includessegmentation of the Java byte-codes into basic blocks and descriptorsfor the structure of for loops, if statements, etc.

There is considerable opportunity for fine grained parallelism. Whileparallelism is, in principal, possible to detect automatically, we addedthe keyword par to the syntax parsed by Jikes. Par signifies to the VHDLtranslator that a ‘for’ loop is vectorizable.

The Java byte-codes are translated to VHDL by analyzing the basic blockcontents. Stack and memory references become accesses to wires (thusbeing essentially compiled out) and successive op-codes become, for themost part, cascaded blocks of combinational logic. Array accesses becomeaccesses to RAM. Since each array resides in its own RAM, concurrentaccess to different arrays is supported. Concurrent access to the samearray must be arbitrated across the program.

We targeted our hardware at the Xilinx V800 FPGA. With capacity of800,000 gates, there is room for moderate size software experiments. Inthe future we intend to generalize Trebuchet to address designsinvolving multiple chips so that programs of arbitrary size may beexecuted. Hardware configuration files are generated by standard Xilinxtools [31].

The most significant current aspects of the present patent applicationare, though, the loops discussed below and utilizing a system of finitestate machines built with synchronous logic for controlling the flow ofdata through computational logic circuits programmed to accomplish atask specified by a user, having one finite state machine associatedwith each computational logic circuit, having each finite state machineaccept data from either one or more predecessor finite state machines orfrom one or more sources outside the system and furnish data to one ormore successor finite state machines or a recipient outside the system,excluding from consideration in determining a clock period for thesystem logic paths performing the task specified by the user, andproviding a means for ensuring that each finite state machine allowssufficient time to elapse after the computational logic circuitassociated with that finite state machine has obtained input data thatall signals generated within such computational logic circuit inresponse to such input data have propagated through such computationallogic circuit before communication is permitted to occur from suchcomputational logic circuit to a subsequent computational logic circuit.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 depicts a pipeline consisting of handshaking FSMs andcomputational logic. Each FSM exchanges signals with the prior andsucceeding stages, and supplies a signal to the computational logic forlatching input values. A software basic block is a sequence ofinstructions that contain no branching. A super block is a sequence ofbasic blocks, each of which may be conditionally executed. Theparticulars of the computational logic are derived from super blocks inthe application to be executed.

FIG. 2 depicts the structure of descriptor tags embedded in the Javabyte-stream by the modified Jikes compiler. The compiler never generatesa ‘jump-to self’ instruction. Thus a descriptor tag can be recognized inthe byte-stream by a jump followed by a jump-to self. This allowsTrebuchet code to execute correctly on conventional JVMs. Executionsimply jumps over it.

FIG. 3 shows an example Java fragment translated to a Trebuchet stage.The computational logic is comprised of latches to capture the inputvalues, wires conducting the value of the signal j into an adder, andthe resulting signal k propagating to the next stage.

FIG. 4 shows the layout of a vectorized for-loop. A tight loop initiatessuccessive waves of execution that propagate through the pipe. It is theexecution of vectorized loops that lend the Trebuchet its performanceadvantage and pseudo-asynchronous attributes.

FIG. 5 illustrates the control of conditional elements. The signalrepresenting the truth value of the test condition switches amultiplexor. Either the computed results or the input values propagateforward based on the selection made by the multiplexor.

MODES FOR CARRYING OUT THE INVENTION

Modifications to the Jikes Java Compiler

The Java code is compiled by a modified version of the Java compilerJikes. We added keywords to the otherwise standard Java syntaxrecognized by Jikes. As noted above, in principle the compiler couldhave been modified to recognize parallelizable and systolizable loops(see [12]). At some point in the future, we intend to do this.

In addition to the par, we also included the keywords netstart, netend,and expose. Netstart and netend indicate respectively the beginning andend of the code to be analyzed for hardware mapping. Expose designatesvariables that are required as output from the execution engine. All ofthese represent expediencies that could, in principal, be automaticallyrecognized by a compiler.

The structures which may be inserted into the Java class files includestart tags, stop tags, and parallel loop descriptor tags. The latterdesignate ‘for’ loops for vectorization. The byte code interpreter hasspecial code added to it to detect this extra information in theexecution stream of the program. Because it is desirable that codethusly modified also be executable by conventional JVM platforms, thetags are structured so that a conventional JVM will simply jump aroundthem.

We needed to format the tags so that they could be unambiguouslyrecognized in the JVM execution stream. The compiler never generates ajump-to-self instruction, so tags are constructed as a jump followed bya jump-to-self, followed by tag specific information. See FIG. 2.

Modifications to the Java Virtual Machine

One of the purposes of the Trebuchet is to experiment with mapping ofregions of software onto hardware regions. The theory of this mapping,published in [3], depends on profile information with the number oftimes communication arcs are utilized, rather than the number of timesnodes are visited, as in a standard profile. Consequently the JVM wasaltered to collect transfer of control statistics. Java_g was tailoredto output a file of bytecodes segmented into basic blocks (basic blocksare sequences of code which terminate at program jumps).

Bytecode Translation to VHDL

Trebuchet, written in Common Lisp, translates the basic block andprofile information provided by the modified JVM. Trebuchet symbolicallytraverses each basic block, generating combinational logic correspondingto the sequence of instructions. Some instructions (e.g.,multiplication) are impractical to configure as purely combinationallogic and necessitate further segmentation. Trebuchet also constructshierarchical components, such as ‘for’ loops, that consist of acontrolling FSM (finite state machine) and other subcomponents.

A basic block is a sequence of code unbroken by changes in sequentialflow (except at the terminus). Thus there are not multiple paths ofexecution within a basic block. Trebuchet traverses a basic block,examining each instruction. Byte codes that manipulate memory (eitherstack or variable store) such as IPUSH rearrange the set of workingwires. Operations that produce values (e.g., IADD) take their inputsfrom the set of working wires (deleting them from the set) and introducenew wires with the outputs. Trebuchet translates each basic block to acombinational net of logic. FIG. 3 shows an example basic block insource code and as hardware logic.

Many compilers manipulate stacks internally and map stack locations toregisters. Pushing an object on the stack, while conceptually moving theentire stack, in reality only changes the association between registernames and stack offsets. Trebuchet does this with the set of wiresrepresenting program memory and the stack.

Vectorized Loops

Because the algorithms we want to investigate with Trebuchet manipulateprograms that stream data through operators, we needed a way to generatecode that could execute systolicly. Java does not have a paralleloperator, so we added one.

Conceptually, par for loops have four parts. These include theinitialization_clause, the end_test, and the step_clause common totraditional for loops. The loop_body is run overlapped. As FIG. 4 shows,a par for loop is organized with a tight control loop that repetitivelysteps the loop variable, tests the termination condition, and initiatesan entry into the pipelined loop body. All three of the control stepsare executed in parallel to minimize the latency between subsequentpipeline entries. The pipeline initiation may be thought of as a threadthat executes a particular loop iteration. This construction is not farremoved from loop vectorization, a well studied topic in computerscience [12] and could, in principle, have been accomplishedautomatically. But for our purposes, it is enough to use the par keywordto designate code that can be validly overlapped in execution.

Trebuchet generates a vectorized loop from the body of the Java ‘for’loop and the control clauses specified with it. It rearranges thecontrols and forms a pipeline from the succession of basic blocks in thebody of the loop. The test condition, the stepping of loop variables,and the first stage of the pipe are all initiated in parallel.

Each iteration depends on the ‘current’ loop variables, as does the endtest. Since these are all executed in parallel, the initiation loop isunrolled to pre-compute the data. Additionally, each iterationpropagating down the pipelined loop body carries forward a flag thatsignals the final iteration of the loop. The loop test, in Java, isintended as a condition for breaking out of the loop, and thus signalson completion of the final iteration.

The end condition must be propagated down the pipe because of thebehavior of the last pipeline stage. The last stage consumes each threaduntil signaled that the last iteration has arrived. In this specialcase, it handshakes its results out to whatever follows the loop.

Since each iteration propagates an end test value corresponding to thenext iteration, and the test itself depends on stepped variables, eachcycle must pre-compute index variables that are two iterations ahead anda test value that is one iteration ahead. The control loop is unrolledto obtain these phase relationships. One consequence of this is that,lice old style FORTRAN do loops, par for loops must be guaranteed toexecute once. Another restriction on valid par for loops is that stepand test clauses not have side effects or access arrays.

Conditional Code

The Java compiler handles conditionally executed code by jumping aroundit. In vectorized code, successive executions (threads) must not beallowed to overtake and pass prior threads. Consequently, Trebuchetthreads propagate from stage to stage, even where execution issuppressed. The boolean test result propagates through the range of theconditional execution. At each stage, it suppresses update of the valuespassed to the next stage by switching a multiplexer. FIG. 5 illustratesthis operation. Similar measures have been taken by makers of vectorprocessing units for conventional computers [12]. Note that even thoughpipeline threads must traverse stages for which execution is suppressed,such traversal is rapid because the controller FSM immediately transfersto output states without waiting for the logic to propagate through thecomputational logic.

Array References

Java arrays are dynamic in the sense that they may be created at anytime, moved around as needed to optimize garbage collection, andreclaimed by the garbage collector when abandoned. We did not want tosubject the hardware generated by Trebuchet to the performance penaltiesinherent in such manipulation, so we chose to map array creations tostatic arrays created in the FPGA at configuration time.

If, during the course of symbolic modeling of stack and variable store,a reference is made to a location identified with a particular array,the mechanism to access that array is constructed. If the arrayreference cannot be ascertained, the module is marked as not beingcompatible with realization as hardware, and must consequently executefrom bytecodes by the JVM.

Pseudo-Asynchronous Execution

Trebuchet, when generating VHDL, calculates the length of the logic pathfor each pipeline stage and generates a FSM controller with enough waitstates to allow signals to propagate. In order to decouple the clockperiod from this path, Trebuchet generates multi-cycle clockspecifications for the Xilinx tools [30]. This allows the clock periodto be driven by the exchange of handshaking signals rather than by thecritical path through the combinational logic. This means that forlightly loaded conditions, the average stage delay dominates the pipetransit time, instead of the worst case stage delay. This is one of theadvantages touted for the asynchronous micropipeline [27].

Another desirable trait of asynchronous circuitry is that, without asynchronizing clock, logic transitions are very well distributed intime. This minimizes current draw on the power supply and reduces thelevel of radiated EMI (electromagnetic interference) [6]. The tendencyof synchronous logic is to have a well defined signature of successivegates transitioning (and drawing power). It is expected that heavy usageof multi-cycle logic paths will have the effect of smearing thesesignatures, thus obtaining some of the advantage of purely asynchronouscircuitry.

Drawbacks of asynchronous circuits include sensitivity to signal noise,performance dependent on temperature and process variations, andincompatibility with conventional FPGA tools [6]. The Trebuchet avoidsthese difficulties by being, at heart, synchronous. It combines the bestaspects of both worlds.

Preferably, the invention is machinery configured into a FPGA. Thetranslator (compiler) should recognize vectorizable loops withoutexplicit designation in the application source code. This avoids thenecessity of designating in application source code those structuresthat may be pipelined.

Successive conditional blocks (code that may or may not be executed ascontrolled by an IF statement) should be cascaded as a single pipelinestage under the control of a single FSM. This lengthens the logic signalpaths driven by clock signals, giving better overlapping. It alsominimizes the number of FSMs, allowing limited on-chip resources to beused for computational circuitry instead of overhead necessary forconstructing FSMs.

And a preferred system of finite state machines is built withsynchronous logic, although asynchronous logic is acceptable, forcontrolling the flow of data through computational logic circuitsprogrammed to accomplish a task specified by a user, having one finitestate machine associated with each computational logic circuit, havingeach finite state machine accept data from either one or morepredecessor finite state machines or from one or more sources outsidethe system and furnish data to one or more successor finite statemachines or a recipient outside the system, excluding from considerationin determining a clock period for the system logic paths performing thetask specified by the user, and providing a means for ensuring that eachfinite state machine allows sufficient time to elapse after thecomputational logic circuit associated with that finite state machinehas obtained input data that all signals generated within suchcomputational logic circuit in response to such input data havepropagated through such computational logic circuit before communicationis permitted to occur from such computational logic circuit to asubsequent computational logic circuit.

In one embodiment, an alternative to using wait states for ensuring thateach finite state machine allows sufficient time to elapse after thecomputational logic circuit associated with that finite state machinehas obtained input data that all signals generated within suchcomputational logic circuit in response to such input data havepropagated through such computational logic circuit before communicationis permitted to occur from such computational logic circuit to asubsequent computational logic circuit is achieved by using a count-downtimer wherein a register is set to a sufficient number of clock cyclesfor the computational logic circuitry to perform its task anddecremented with each clock until reaching zero.

In another alternate embodiment, the option to using wait states forensuring that each finite state machine allows sufficient time to elapseafter the computational logic circuit associated with that finite statemachine has obtained input data that all signals generated within suchcomputational logic circuit in response to such input data havepropagated through such computational logic circuit before communicationis permitted to occur from such computational logic circuit to asubsequent computational logic circuit is achieved by using a count-uptimer wherein a register is set to zero and increased with each clockuntil reaching a sufficient number of clock cycles for the computationallogic circuitry to perform its task.

INDUSTRIAL APPLICABILITY

The way in which the System of Finite State Machines is capable ofexploitation in industry and the way in which the System of Finite StateMachines can be made and used are obvious from the description and thenature of the System of Finite State Machines.

1. In a system of finite state machines built with synchronous logic forcontrolling the flow of data through computational logic circuitsprogrammed to accomplish a task specified by a user, having one finitestate machine associated with each computational logic circuit, havingeach finite state machine accept data from either one or morepredecessor finite state machines or from one or more sources outsidethe system and furnish data to one or more successor finite statemachines or a recipient outside the system, excluding from considerationin determining a clock period for the system logic paths performing thetask specified by the user, and providing a means for ensuring that eachfinite state machine allows sufficient time to elapse after thecomputational logic circuit associated with that finite state machinehas obtained input data that all signals generated within suchcomputational logic circuit in response to such input data havepropagated through such computational logic circuit before communicationis permitted to occur from such computational logic circuit to asubsequent computational logic circuit.
 2. The system as recited inclaim 1, wherein: the means for ensuring that each finite state machineallows sufficient time to elapse after the computational logic circuitassociated with that finite state machine has obtained input data thatall signals generated within such computational logic circuit inresponse to such input data have propagated through such computationallogic circuit before communication is permitted to occur from suchcomputational logic circuit to a subsequent computational logic circuitis a count-down timer wherein a register is set to a sufficient numberof clock cycles for the computational logic circuitry to perform itstask and decremented with each clock until reaching zero.
 3. The systemas recited in claim 1, wherein: the means for ensuring that each finitestate machine allows sufficient time to elapse after the computationallogic circuit associated with that finite state machine has obtainedinput data that all signals generated within such computational logiccircuit in response to such input data have propagated through suchcomputational logic circuit before communication is permitted to occurfrom such computational logic circuit to a subsequent computationallogic circuit is a count-up timer wherein a register is set to zero andincreased with each clock until reaching a sufficient number of clockcycles for the computational logic circuitry to perform its task.
 4. Thesystem as recited in claim 1, wherein: the means for ensuring that eachfinite state machine allows sufficient time to elapse after thecomputational logic circuit associated with that finite state machinehas obtained input data that all signals generated within suchcomputational logic circuit in response to such input data havepropagated through such computational logic circuit before communicationis permitted to occur from such computational logic circuit to asubsequent computational logic circuit is configuring each finite statemachine with sufficient wait states for all signals generated withinsuch computational logic circuit in response to such input data topropagate through such computational logic circuit before communicationis permitted to occur from such computational logic circuit to asubsequent computational logic circuit.
 5. In a system of finite statemachines built with asynchronous logic for controlling the flow of datathrough computational logic circuits programmed to accomplish a taskspecified by a user, having one finite state machine associated witheach computational logic circuit, having each finite state machineaccept data from either one or more predecessor finite state machines orfrom one or more sources outside the system and furnish data to one ormore successor finite state machines or a recipient outside the system,excluding from consideration in determining a clock period for thesystem logic paths performing the task specified by the user, andproviding a means for ensuring that each finite state machine allowssufficient time to elapse after the computational logic circuitassociated with that finite state machine has obtained input data thatall signals generated within such computational logic circuit inresponse to such input data have propagated through such computationallogic circuit before communication is permitted to occur from suchcomputational logic circuit to a subsequent computational logic circuit.